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Abstract. We present a new analysis of the problem of learning with drifting 
distributions in the batch setting using the notion of discrepancy. We prove learn- 
ing bounds based on the Rademacher complexity of the hypothesis set and the 
discrepancy of distributions both for a drifting PAC scenario and a tracking sce- 
nario. Our bounds are always tighter and in some cases substantially improve 
upon previous ones based on the L\ distance. We also present a generalization 
of the standard on-line to batch conversion to the drifting scenario in terms of 
the discrepancy and arbitrary convex combinations of hypotheses. We introduce 
a new algorithm exploiting these learning guarantees, which we show can be for- 
mulated as a simple QR Finally, we report the results of preliminary experiments 
demonstrating the benefits of this algorithm. 
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1 Introduction 

In the standard PAC model [1] and other similar theoretical models of learning [2], the 
distribution according to which training and test points are drawn is fixed over time. 
However, for many tasks such as spam detection, political sentiment analysis, financial 
market prediction under mildly fluctuating economic conditions, or news stories, the 
learning environment is not stationary and there is a continuous drift of its parameters 
over time. 

There is a large body of literature devoted to the study of related problems both in 
the on-line and the batch learning scenarios. In the on-line scenario, the target function 
is typically assumed to be fixed but no distributional assumption is made, thus input 
points may be chosen adversarially [3]. Variants of this model where the target is al- 
lowed to change a fixed number of times have also been studied [3, 4, 5, 6]. In the 
batch scenario, the case of a fixed input distribution with a drifting target was originally 
studied by Helmbold and Long [7]. A more general scenario was introduced by Bartlett 
[8] where the joint distribution over the input and labels could drift over time under the 
assumption that the L\ distance between the distributions in two consecutive time steps 
was bounded by A. Both generalization bounds and lower bounds have been given for 
this scenario [9, 10]. In particular, Long [9] showed that if the L\ distance between 
two consecutive distributions is at most A, then a generalization error of 0{{dA) 1 / 3 ') 
is achievable and Barve and Long [10] proved this bound to be tight. Further improve- 
ments were presented by Freund and Mansour [11] under the assumption of a constant 



rate of change for drifting. Other settings allowing arbitrary but infrequent changes of 
the target have also been studied [12]. An intermediate model of drift based on a near 
relationship was also recently introduced and analyzed by [13] where consecutive dis- 
tributions may change arbitrarily, modulo the restriction that the region of disagreement 
between nearby functions would only be assigned limited distribution mass at any time. 

This paper deals with the analysis of learning in the presence of drifting distributions 
in the batch setting. We consider both the general drift model introduced by [8] and a 
related drifting PAC model that we will later describe. We present new generalization 
bounds for both models (Sections 3 and 4). Unlike the L\ distance used by previous 
authors to measure the distance between distributions, our bounds are based on a notion 
of discrepancy between distributions generalizing the definition originally introduced 
by [14] in the context of domain adaptation. The Li distance used in previous analyses 
admits several drawbacks: in general, it can be very large, even in favorable learning 
scenarios; it ignores the loss function and the hypothesis set used; and it cannot be 
accurately and efficiently estimated from finite samples (see for example lower bounds 
on the sample complexity of testing closeness by [15]). In contrast, the discrepancy 
takes into consideration both the loss function and the hypothesis set. 

The learning bounds we present in Sections 3 and 4 are tighter than previous bounds 
both because they are given in terms of the discrepancy which lower bounds the L\ dis- 
tance, and because they are given in terms of the Rademacher complexity instead of the 
VC-dimension. Additionally, our proofs are often simpler and more concise. We also 
present a generalization of the standard on-line to batch conversion to the scenario of 
drifting distributions in terms of the discrepancy measure (Section 5). Our guarantees 
hold for convex combinations of the hypotheses generated by an on-line learning algo- 
rithm. These bounds lead to the definition of a natural meta-algorithm which consists of 
selecting the convex combination of weights in order to minimize the discrepancy-based 
learning bound (Section 6). We show that this optimization problem can be formulated 
as a simple QP and report the results of preliminary experiments demonstrating its ben- 
efits. Finally we will discuss the practicality of our algorithm in some natural scenarios. 

2 Preliminaries 

In this section, we introduce some preliminary notation and key definitions, including 
that of the discrepancy between distributions, and describe the learning scenarios we 
consider. 

Let X denote the input space and y the output space. We consider a loss function 
L: y x y — >• R + bounded by some constant M > 0. For any two functions h,h': X — > 
y and any distribution D over X x y, we denote by Coih) the expected loss of h and 
by Co(h, h') the expected loss of h with respect to h': 

C D (h)= E [L(h(x),y)\ and C D {h,ti) = E [L(h{x), h'(x))], (1) 

where D 1 is the marginal distribution over X derived from D. We adopt the standard 
definition of the empirical Rademacher complexity, but we will need the following se- 
quential definition of a Rademacher complexity, which is related to that of [16]. 



Definition 1. Let G be a family of functions mapping from a set Z to R and S = 
[z\, . . . ,zt) a fixed sample of size T with elements in Z. The empirical Rademacher 
complexity ofG for the sample S is defined by: 



Xs(G) = E 



1 T 

sup ^ Yl a t9(z t ) 



(2) 



where cr = (<7i, . . . , cr<r) T , with a t s independent uniform random variables taking 
values in {— 1,+1}. The Rademacher complexity of G is the expectation of Vis (G) 
over all samples S = {z\, . . . ,Zt) of size T drawn according to the product distribution 
D = <g>f =1 D t : 

D\ T (G) = EjXsiG)}. (3) 

Note that this coincides with the standard Rademacher complexity when the distribu- 
tions D t , t£ [1, T], all coincide. 

A key question for the analysis of learning with a drifting scenario is a measure of 
the difference between two distributions D and D'. The distance used by previous au- 
thors is the Li distance. However, the L\ distance is not helpful in this context since it 
can be large even in some rather favorable situations. Moreover, the L\ distance cannot 
be accurately and efficiently estimated from finite samples and it ignores the loss func- 
tion used. Thus, we will adopt instead the discrepancy, which provides a measure of 
the dissimilarity of two distributions that takes into consideration both the loss function 
and the hypothesis set used, and that is suitable to the specific scenario of drifting. 

Our definition of discrepancy is a generalization to the drifting context of the one 
introduced by [14] for the analysis of domain adaptation. Observe that for a fixed hy- 
pothesis h e H, the quantity of interest with drifting distributions is the difference of 
the expected losses £ D > (h) —Co{h) for two consecutive distributions D and D'. A nat- 
ural distance between distributions in this context is thus one based on the supremum 
of this quantity over all h e H. 

Definition 2. Given a hypothesis set H and a loss function L, the ^-discrepancy discj; 
between two distributions D and D' over X x y is defined by: 

discy (£>,£>') = sup \C D >(h)-C D (h)\. (4) 

h£H 

In a deterministic learning scenario with a labeling function /, the previous definition 
becomes 

disc* (D,D') = sup \C D n(f,h)-C D i(f,h)\, (5) 

heH 

where D' 1 and D 1 are the marginal distributions associated to D and D' defined over 
X. The target function / is unknown and could match any hypothesis h' . This leads to 
the following definition [14]. 

Definition 3. Given a hypothesis set H and a loss function L, the discrepancy disc 
between two distributions D and D' over X x y is defined by: 

disc(L>, D') = sup \C D n(h',h) - C D i(h',h)\. (6) 

h,h>£H 



An important advantage of this last definition of discrepancy, in addition to those al- 
ready mentioned, is that it can be accurately estimated from finite samples drawn from 
D' 1 and D 1 when the loss is bounded and the Rademacher complexity of the family of 
functions L H = {x ^ L(h'(x),h(x)): h,h' € H} is in 0(l/VT), where T is the 
sample size; in particular when Lh has a finite pseudo-dimension [14]. The discrep- 
ancy is by definition symmetric and verifies the triangle inequality for any loss function 
L. In general, it does not define a distance since we may have disc(Z), D') = for 
D' 7^ D. However, in some cases, for example for kernel-based hypothesis sets based 
on a Gaussian kernel, the discrepancy has been shown to be a distance [17]. 

We will present our learning guarantees in terms of the ^-discrepancy disey, that 
is the most general definition since guarantees in terms of the discrepancy disc can be 
straightforwardly derived from them. The advantage of the latter bounds is the fact that 
the discrepancy can be estimated in that case from unlabeled finite samples. 

We will consider two different scenarios for the analysis of learning with drifting 
distributions: the drifting PAC scenario and the drifting tracking scenario. 

The drifting PAC scenario is a natural extension of the PAC scenario, where the ob- 
jective is to select a hypothesis h out of a hypothesis set H with a small expected loss 
according to the distribution £>t+i after receiving a sample of T > 1 instances drawn 
from the product distribution ® t=1 D t . Thus, the focus in this scenario is the perfor- 
mance of the hypothesis h with respect to the environment distribution after receiving 
the training sample. 

The drifting tracking scenario we consider is based on the scenario originally intro- 
duced by [8] for the zero-one loss and is used to measure the performance of an algo- 
rithm A (as opposed to any hypothesis h). In that learning model, the performance of 
an algorithm is determined based on its average predictions at each time for a sequence 
of distributions. We will generalize its definition by using the notion of discrepancy 
and extending it to other loss functions. The following definitions are the key concepts 
defining this model. 

Definition 4. For any sample S = (x t ,yt)f = i of size T, we denote by hr-i € H the 
hypothesis returned by an algorithm A after receiving the first T — 1 examples and by 
Mt its loss or mistake on Xt-' Mt — L(/it-i(£t), Ut)- For a product distribution 
D = D t on (X x y) T we denote by Mt(D) the expected mistake of A: 

M T {D) = EjMr] - E D [L(hr-i(x T ),Vr)]. 

Definition 5. Let A > and let Mt be the supremum of Mt{D) over all distribution 
sequences D = (D t ), with discy(D t , D t+1 ) < A. Algorithm A is said to (A, e)-track 
H if there exists t such that for T > t we have M T < mi heH £ Dt (h) + e. 

An analysis of the tracking scenario with the L\ distance used to measure the diver- 
gence of distributions instead of the discrepancy was carried out by Long [9] and Barve 
and Long [10], including both upper and lower bounds for Mt in terms of A. Their 
analysis makes use of an algorithm very similar to empirical risk minimization, which 
we will also use in our theoretical analysis of both scenarios. 



3 Drifting PAC scenario 

In this section, we present guarantees for the drifting PAC scenario in terms of the 
discrepancies of D t and D T+ i , t G and the Rademacher complexity of the 

hypothesis set. We start with a generalization bound in this scenario and then present a 
bound for the agnostic learning setting. 

Let us emphasize that learning bounds in the drifting scenario should of course not 
be expected to converge to zero as a function of the sample size but depend instead on 
the divergence between distributions. 

Theorem 1. Assume that the loss function L is bounded by M. Let D\, . . . , Z?t+i be 
a sequence of distributions and let Hl = {(x, y) L(h{x),y) : h G H}. Then, for 
any 5 > 0, with probability at least 1—5, the following holds for all h G H: 



T T 

C DT+1 (h) <^J2L(h(xt),yt)+2^ T (H L ) + ^J2discy(D t ,D T+1 )+M^ 



t=i 



L = l 



' log? 
2T 



Proof. We denote by D the product distribution ® t=1 D t . Let ^ be the function defined 
over any sample S = ((x 1 , j/i), . . . , (x T , Vt)) e (X x y) T by 

1 T 

${S) = sup C Dt+1 (h) - - ^ L(h(x t ),y t ). 



heH 



t=i 



Let S and S' be two samples differing by one labeled point, say (x t , yt ) in S and (x' t , y' t ) 
in S', then: 



- $(S) < sup - L(h(x' t ), y' t ) - L(h(x t ),y t ) 
heH 1 L 

Thus, by McDiarmid's inequality, the following holds: 3 



< 



M 



Pr 

S~D 



<P(S) - E [$(S)] > ej < exp(-2Te 2 /M 2 ). 



We now bound E Sr ^ D [<P(S)} by first rewriting it, as follows: 
E 



1 T 1 T 1 T 

sup C Dt+1 (h)- L °t + ¥ H Cd * ~ ¥ Yl L ( h ( x t)> Vt) 
h ^ H T f^i T 7^i T 7^i 



<E 



1 T I r 1 T 1 T 

sup£c T+1 (/i)-- y2 C Dt (h) +E sup-V£D t (/i)--Vl(% t ),)/ t ) 



1 1 

< E [- J2 SU P W - £o t (h)) + sup - ]T (£o t (h) - L(h(x t ),y t )) 

T T 

< i^disc y (A,r>T+i)+E[sup ±;J2(C Dt (h)-L(h(x t ),y t )) 



3 Note that McDiarmid's inequality does not require points to be drawn according to the same 
distribution but only that they would be drawn independently. 



It is not hard to see, using a symmetrization argument as in the non-sequential case, that 
the second term can be bounded by 2*R T (H L ). □ 

For many commonly used loss functions, the empirical Rademacher complexity 
V^t(Hl) can be upper bounded in terms of that of the function class H. In particular, 
for the zero-one loss it is known that DIt(Hl) = 9$t{H)/2 and when L is the L q 
loss for some q > 1, that is L(y,y') = \y' — y\ q for all y,y' e y, then %Kt{Hl) < 
gM«- 1 $H T (i?). Indeed, since x ^ \x\ q is gM^-Lipschitz over [-M, +M], by Tala- 
grand's contraction lemma, D\t(Hl) is bounded by qM q ~ 1 *RT(G) with 
G = {(x, y) i y (h(x)—y) : h e H}. Furthermore, D\t(G) can be analyzed as follows: 



£t(G) = ^E 

1 <T 



sup^2<T t {h{x t ) - y t ) 



hen 



t=i 



:E 



sup y^cr t fe(xt) 



h£H 



t=l 



E 



T 

J2 ^tyt 



t=l 



Xt(H), 



since E CT [^ t=1 — a t yt] = 0. Taking the expectation of both sides yields a similar in- 
equality for Rademacher complexities. Thus, in the statement of the previous theorem, 
9^t(Hl) can be replaced with qM q ~ 1<; J\T{H) when L is the L q loss. 

Observe that the bound of Theorem 1 is tight as a function of the divergence measure 
(discrepancy) we are using. Consider for example the case where D\ = . . . = Dt, then 
a standard Rademacher complexity generalization bound holds for all h e H: 

1 T 

c DT (h) < f J2 L ( h ( x t),y t ) + 2m T (H L ) + 0(i/Vf). 
t=i 

Now, our generalization bound for £ DT+1 (h) includes only the additive term 
discy(D t , Dt+i), but by definition of the discrepancy, for any e > 0, there exists 
h e H such that the inequality \£d t+1 (h) — £_d t (/i)| < discy(D t , Dt+i) + e holds. 

Next, we present PAC learning bounds for empirical risk minimization. Let h* T be 
a best-in class hypothesis in H, that is one with the best expected loss. By a similar 
reasoning as in theorem 1, we can show that with probability 1 — | we have 



i T i T ho - 

^L{h* T {x t ),y t )<C DT+1 (h T )+2m T (H L )+-Yjiiscy(D t ,D T+1 )+2M\ ' ' 7 



t=l t=l 



2T 



Let hx be a hypothesis returned by empirical risk minimization (ERM). Combining this 
inequality with the bound of theorem 1 while using the definition of hx and using the 
union bound, we obtain that with probability 1 — 8 the following holds: 



c Dt+1 (h T )-c DT+1 (h* T ) < m T (H L )+^ disc *(A, d t+1 )+2mJ 1 ^. 



(7) 



This learning bound indicates a trade-off: larger values of the sample size T guarantee 
smaller first and third terms; however, as T increases, the average discrepancy term is 



likely to grow as well, thereby making learning increasingly challenging. This suggests 
an algorithm similar to empirical risk minimization but limited to the last m examples 
instead of the whole sample with m < T. This algorithm was previously used in [10] 
for the study of the tracking scenario. We will use it here to prove several theoretical 
guarantees in the PAC learning model. 

Proposition 1. Let A > 0. Assume that (D t ) t > is a sequence of distributions such 
that discy (D t , D t +\) < A for all t > 0. Fix m > 1 and let hx denote the hypothesis 

returned by the algorithm A that minimizes Y^t=T-m ^(M x t)i Vt) after receiving T > 
m examples. Then, for any 6 > 0, with probability at least 1 — 5, the following learning 
bound holds: 



ho - 

C DT+1 (h T )- inf C Dr+1 (h) <4m m {H L ) + {m + l)A + 2M\ ^p-. (8) 

Proof. The proof is straightforward. Notice that the algorithm discards the first T — m 
examples and considers exactly m instances. Thus, as in inequality 7, we have: 



2 loe^ 
C DT+l (h T ) - C DT+1 (h* T ) <m m (H L ) + - ]T disc(A,Ar+i) + 2MW^. 

t=T-m ' 

Now, we can use the triangle inequality to bound disc(Z) t , D T +i) by (T + 1 — m)A. 
Thus, the sum of the discrepancy terms can be bounded by (m + I) A. □ 

To obtain the best learning guarantee, we can select m to minimize the bound just pre- 
sented. This requires the expression of the Rademacher complexity in terms of m. The 
following is the result obtained when using a VC-dimension upper bound of 0{\fd~fm) 
for the Rademacher complexity. 

Corollary 1. Fix A > 0. Let H be a hypothesis set with VC-dimension d such that 
for all m > 1, £H m (_ffi) < ^f\J~^ for some constant C > 0. Assume that (D t )t>o 
is a sequence of distributions such that discy (D t , D t +\) < A for all t > 0. Then, 

there exists an algorithm A such that for any 8 > 0, the hypothesis it returns 

2 i — 

Q±s ^\ 3 (37)5 instances, where C = 2M\J l -^A, satisfies the 



after receiving T > 

following with probability at least 1 — 5: 

2/3 



C DT+1 {h T ) - inf C DT+1 {h) < 3 

Alt ti 



C + C" 
2~ 



{dAf'^ + A. (9) 

Proof: Fix S > 0. Replacing y\ rn (H L ) by the upper bound ^ in (8) yields 

Cd t+1 (h T ) - if H C Dt+1 (h) <(C + C")^- + (m+l)A. 
Choosing m = ( c+ „ c ) f ( -p- ) 3 to minimize the right-hand side gives exactly (9). □ 



When H has finite VC-dimension d, it is known that d\ m {HL) can be bounded by 
C^Jd/m for some constant C > 0, by using a chaining argument [20, 21, 22]. Thus, 
the assumption of the corollary holds for many loss functions L, when H has finite 
VC-dimension. 

4 Drifting Tracking scenario 

In this section, we present a simpler proof of the bounds given by [9] for the agnostic 
case demonstrating that using the discrepancy as a measure of the divergence between 
distributions leads to tighter and more informative bounds than using the L x distance. 

Proposition 2. Let A > and let (D t ) t >o be a sequence of distributions such that 
discy(D t , D t +i) < Aforallt > O.Letm > 1 and let Kt be as in proposition 1. Then, 

E[M T+1 ]-MjC DT+1 (h) <m m (H L ) + 2M^- + (m+l)A. (10) 
Proof. Let D = ®t=i D t and D' — (£)J =1 D t . By Fubini's theorem we can write: 



E[M T+ i] - inf C Dt+1 (h) = E C Dt+1 (h T ) - inf C Dt+1 (h) 



(11) 



Now, let = m m (H L ) + (m + I) A + 2My^§J, then, by (8), for 

(3 > 4UR TO (/i) + (m + I) A, the following holds: 

Pr[£ DT+1 (/i T ) -mf£ DT+1 (h) > ft < 0(/3). 
Thus, the expectation on the right-hand side of (1 1) can be bounded as follows: 



E 

D' 



/'OO 

C DT+1 (h T )-mi C Dr+1 (h) <m m (H L )+(m+l)A+ <f>{P)d/3. 

h Jm m (H L ) + (m+l)A 



The last integral can be rewritten as 2M f„ 2 , dS = 2M x f^ using the change of 

JU V ml °S J V m - 

variable 5 = (f>(f3). This concludes the proof. □ 
The following corollary can be shown using the same proof as that of corollary 1 . 

Corollary 2. Fix A > 0. Let H be a hypothesis set with VC-dimension d such that 
for all m> 1, m m {H L ) < cJ±. Let (D t ) t>0 be a sequence of distributions over 



2 /3 

X x y such that disc y (A, A+i) < A. Let C = 2M \f\ and K = 3 



C+C 

2 



Then, for T > [ 2 ] 3 (zP") 5 > the following inequality holds: 



E[M T+1 ] - mi C Dr+1 (h) <K(dA)^ 3 + A. 



Ill 
III 



Fig. 1. Figure depicting the difference between the L\ distance and the discrepancy. In the left 
figure, the L\ distance is given by twice the area of the green rectangle. In the right figure, 
P(h(x) ^ h'(x)) is equal to the area of the blue rectangle and Q(h(x) =^ h'(x)) is the area of 
the red rectangle. The two areas are equal, thus disc(P, Q) — 0. 

In terms of definition 5, this corollary shows that algorithm A (A, K(dA) 1 ^ 3 + A)- 
tracks H . This result is similar to a result of [9] which states that given e > if A = 
0(de 3 ) then A (A, e)-tracks H. However, in [9], A is an upper bound on the L\ distance 
and not the discrepancy. Our result provides thus a tighter and more general guarantee 
than that of [9], the latter because this result is applicable to any loss function and 
not only the zero-one loss, the former because our bound is based on the Rademacher 
complexity instead of the VC-dimension and more importantly because it is based on 
the discrepancy, which is a finer measure of the divergence between distributions than 
the L\ distance. Indeed, for any t € [1, T], 

disc y (A, A+i) = sup \C Dt {h) - £ Dt+1 (h)\ 

h£H 



= sup | y2(D t (x,y) - D t+1 (x,y))L(h(x),y)\\ 

< M sup \Dt(x, y) - D t+1 (x, y)\ = ML 1 (D t ,D t+1 ). 



heH 



Furthermore, when the target function / is in H, then the ^-discrepancies can be 
bounded by the discrepancies disc(Z? t , Dt+i), which, unlike the L\ distance, can be 
accurately estimated from finite samples. 

It is important to emphasize that even though our analysis was based on a particular 
algorithm, that of "truncated" empirical risk minimization, the bounds obtained here 
cannot be improved upon in the general scenario of drifting distributions, as shown by 
[10] in the case of binary classification. 

We now illustrate the difference between the guarantees we present and those based 
on the L\ distance by presenting a simple example for the zero-one loss where the L\ 



distance can be made arbitrarily close to 2 while the discrepancy is 0. In that case, our 
bounds state that the learning problem is as favorable as in the absence of any drift- 
ing, while a learning bound with the L x distance would be uninformative. Consider 
measures P and Q in R 2 . Where P is uniform in the rectangle R\ defined by the ver- 
tices ( — 1 , R) , (1,-1), (— 1, — 1) and Q is uniform in the rectangle R 2 spanned 
by (—1, —R), (1, —R), (— 1, 1), (1, 1). The measures are depicted in figure 1. The L\ 
distance of these probability measures is given by twice the difference of measure in 
the green rectangle, i.e, \P — Q\ = 2^— p this distance goes to 2 as R — > oo. On 
the other hand consider the zero-one loss and the hypothesis set consisting of thresh- 
old functions on the first coordinate, i.e. h(x, y) = 1 iff h < x. For any two hy- 
potheses h < h' the area of disagreement of this two hypotheses is given by the stripe 
S = {x: h < x < h'}. But it is trivial to see that P(S) = P(S n Ri) = (h- ti) /2, 
but also Q(S) = Q(S n R2) = (h — h')/2, since this is true for any pair of hypotheses 
we conclude that disc(P, Q) = 0. This example shows that the learning bounds we 
presented can be dramatically more favorable than those given in the past using the L\ 
distance. 

Although this may be viewed as a trivial illustrative example, the discrepancy and 
the Li distance can greatly differ in more complex but realistic cases. 

5 On-line to batch conversion 

In this section, we present learning guarantees for drifting distributions in terms of the 
regret of an on-line learning algorithm A. The algorithm processes a sample (x t )t>i 
sequentially by receiving a sample point x t € X, generating a hypothesis h t , and in- 
curring a loss L(h(x t ),yt), with y t e y. We denote by Rt the regret of algorithm A 
after processing T > 1 sample points: 

T T 

R T = ^2 l L(h(x t ),y t ) - mf ^ L{h{x t ), y t ). 
t=i e t=i 

The standard setting of on-line learning assumes an adversarial scenario with no dis- 
tributional assumption. Nevertheless, when the data is generated according to some 
distribution, the hypotheses returned by an on-line algorithm A can be combined to de- 
fine a hypothesis with strong learning guarantees in the distributional setting when the 
regret R T is in 0(y/T) (which is attainable by several regret minimization algorithms) 
[23, 24]. Here, we extend these results to the drifting scenario and the case of a convex 
combination of the hypotheses generated by the algorithm. The following lemma will 
be needed for the proof of our main result. 

Lemma 1. Let S = (xt,yt)t=\ be a sample drawn from the distribution D = D t 
and let (ht)J—i be the sequence of hypotheses returned by an on-line algorithm se- 
quentially processing S. Let w = (w\, . . . , w t ) T be a vector of non-negative weights 
verifying J2t=i w t = 1- If the loss function L is bounded by M then, for any S > 0, 



with probability at least 1 — 5, each of the following inequalities hold: 



J2 w t C DT+1 (h t ) < w t L(h t (x t ), y t ) + A(w, T) + M\\w\\ 2 J 2 log - 



t=i t=i 

T T 



w t L{h t {x t ),y t ) < w t C DT+1 {h t ) + A(w, T) + M||w|| 2 J 2 log ^ 
t=i t=i 

w/zere 4(w, T) denotes the average discrepancy Y^t=i w t discy(D t , Dt+i). 

Proof. Consider the random process: Z t — w t L(h t (x t ),yt) — w t C(ht) and let Ft 
denote the filtration associated to the sample process. We have: \Z t \ < Mw t and 

nZ t \T t -i] = E[wtL(ht(x t ),yt)\F t -i} - E[w t L(h t (x t ),y t )} = 

D D Dt 

The second equality holds because h t is determined at time t—1 and x t , yt are indepen- 
dent of Tt-\ - Thus, by Azuma-Hoeffding's inequality, for any 6 > 0, with probability 
at least 1 — S the following holds: 



wtC Dt (h t ) < wtL(h(x t ),yt) + M\\w\\ 2 J 2 log ^ (12) 



i=l t=l 

By definition of the discrepancy, the following inequality holds for any t € [1, T]: 

£D T+1 ( h t) < £D t {h t ) +disc y (D t ,D T+1 ). 

Summing up these inequalities and using (12) to bound J2t=i w t^D t {ht) proves the 
first statement. The second statement can be proven in a similar way. □ 

The following theorem is the main result of this section. 

Theorem 2. Assume that L is bounded by M and convex with respect to its first ar- 
gument. Let hi, . . . ,fiT be the hypotheses returned by A when sequentially processing 
{ x ti Dt)t=i and let h be the hypothesis defined by h — X)t=i w tht, where W\, . . . , wt 
are arbitrary non-negative weights verifying Ylt=i w t = 1- Then, for any 5 > 0, with 
probability at least 1 — 8, h satisfies each of the following learning guarantees: 

Cd t+1 (h) < mL(h t (xt),yt) + A(w, T) + M||w|| 2 ^21og J 



C DT+1 (h) < lf R C{h) + ^ + A(w,T) + M||w - uo||i + 2M||w|| 2 ^21og ^ 

where w = (wi, . . . , u>t) T , A(w, T) — J2t=i Wtdiscy (D t , Dt+i), and u € R T is 
the vector with all its components equal to 1/T. 



Observe that when all weights are all equal to ^, the result we obtain is similar to the 
learning guarantee obtained in theorem 1 when the Rademacher complexity of H L is 
0(-L=). Also, if the learning scenario is i.i.d., then the first sum of the bound vanishes 
and it can be seen straightforwardly that to minimize the RHS of the inequality we 
need to set w t = ^, which results in the known i.i.d. guarantees for on-line to batch 
conversion [23, 24]. 

Proof. Since L is convex with respect to its first argument, by Jensen's inequality, we 
have £_D T+1 (X)t=i w tht) < Y^t=i w t^D T+1 {ht)- Thus, by Lemma 1, for any 8 > 0, 
the following holds with probability at least 1 — 5: 

C Dt+1 fewtht) < wtL(ht(x t ),y t ) + A(w, T) + M||w|| 2 W21og|. (13) 
\t=i / t=i v 

This proves the first statement of the theorem. To prove the second claim, we will 
bound the empirical error in terms of the regret. For any h* e H, we can write 
using mf heH ^ Y,J=i L (H x t) , Vt) < y Ef=i L(h*(x t ), y t ): 



^2w t L(h t (x t ),yt) -^2w t L(h*(x t ),y t ) 

T T 

=E(^-r) [L( ^ (xt) ' y * ) ~ L(r(x * ) ' yt)]+ T^ [i( ^ ( ^ ) ' yt)_i(r(a:t) ' 2; * )] 

t=l t=l 

T T 

< M||w - uo||i + ^Yl Vt) - mf ^Yl L ( h ^t),yt) 

t=i t=i 

<M||w-uo||i + -^. 

Now, by definition of the infimum, for any e > 0, there exists h* € H such that 
Co T+1 (h.*) < 'vaiheH C-D T+1 {h) + e. For that choice of h*, in view of (13), with 
probability at least 1 — 8/2, the following holds: 

T 

Rt 



£D T+1 W<^^i(/i*(^),yt)+M||w_u ||i + ^+^(w,T)+M||w|| 2 W21og-. 
t=i 

By the second statement of Lemma 1, for any 8 > 0, with probability at least 1 — 8/2, 

T I 

w t L(h* (x t ),y t ) < C Dt+1 (h*) + A(w, T) + M||w|| 2 W2 log - 

t=i 

Combining these last two inequalities, by the union bound, with probability at least 
1 - S, the following holds withS(w,5) = M||w_u ||i + ^f- + 2M||w|| 2 yjl log |: 

Cd t+1 (h) < C Dt+1 (h*) + 2A(w, T) + B(w, 8) 

< inf C Dt+1 (h)+e + 2z5(w, T) + B(w, 8). 

The last inequality holds for all e > 0, therefore also for e = by taking the limit. □ 



6 Algorithm 



The results of the previous section suggest a natural algorithm based on the values of the 
discrepancy between distributions. Let (h t )J =1 be the sequence of hypotheses generated 
by an on-line algorithm. Theorem 2 provides a learning guarantee for any convex com- 
bination of these hypotheses. The convex combination based on the weight vector w 
minimizing the bound of Theorem 2 benefits from the most favorable guarantee. This 
leads to an algorithm for determining w based on the following convex optimization 
problem: 

T 

min X\\w\\l + y2 w t(disc y (D t ,D T+1 )+L{h t (x t ),y t )) (14) 

rp t = l 

subject to: = 1 ) A (Vi G [l,T],w t > 0), 

t=i 

where A > is a regularization parameter. This is a standard QP problem that can be 
efficiently solved using a variety of techniques and available software. 

In practice, the discrepancy values discy (D t , Dt+i) are n °t available since they 
require labeled samples. But, in the deterministic scenario where the labeling function 
/ is in H, we have discy (D t , -Dt+i) < disc(-D t , Dt+i)- Thus, the discrepancy values 
disc (D t , -Dt+i) can be used instead in our learning bounds and in the optimization 
(14). This also holds approximately when / is not in H but is close to some h e H. 

As shown in [14], given two (unlabeled) samples of size n from D t and Dt+i, 
the discrepancy disc(-D t , Dt+i) can be estimated within 0(l/y/n), when £H„(-Hl) = 
0(1/ \fn). In many realistic settings, for tasks such as spam filtering, the distribution D t 
does not change within a day. This gives us the opportunity to collect an independent 
unlabeled sample of size n from each distribution D t . If we choose n T, by the 
union bound, with high probability, all of our estimated discrepancies will be within 
0(l/VT) of their exact counterparts disc(D t , -Dt+i)- 

Additionally, in many cases, the distributions D t remain unchanged over some 
longer periods (cycles) which may be known to us. This in fact typically holds for 
some tasks such as spam filtering, political sentiment analysis, some financial market 
prediction problems, and other problems. For example, in the absence of any major 
political event such as a debate, speech, or a prominent measure, we can expect the 
political sentiment to remain stable. In such scenarios, it should be even easier to col- 
lect an unlabeled sample from each distribution. More crucially, we do not need then to 
estimate the discrepancy for all t € [1, T] but only once for each cycle. 

6.1 Experiments 

Here, we report the results of preliminary experiments demonstrating the performance 
of our algorithm. We tested our algorithm on synthetic data in a regression setting. The 
testing and training data were created as follows: instances were sampled from a two- 
dimensional Gaussian random variables Af(fi t , 1). The objective function at each time 
was given by y t = w t ■ x t . The weight vectors w t and mean vectors fi t were selected as 
follows: fi t — /x t _j + U and w t — Rew t -i, where U is the uniform random variable 
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Fig. 2. Comparison of the performance of three algorithms as a function of the sample size T. 
Weighted stands for the algorithm described in this paper, Regular for an algorithm that 
averages over all the hypotheses, and Fixed for the algorithm that averages only over the last 
100 hypotheses. 

over [—.1, +.1] 2 and Rg a rotation of magnitude 8 distributed uniformly over (—1, 1). 
We used the Widrow-Hoff algorithm [? ] as our base on-line algorithm to determine ht. 
After receiving T examples, we tested our final hypothesis on 100 points taken from the 
same Gaussian distribution Af(fJ, T+1 , 1). We ran the experiment 50 times for different 
amounts of sample points and took the average performance of our classifier. For these 
experiments, we are considering the ideal situation where the discrepancy values are 
given. 

We compared the performance of our algorithm with that of the algorithm that (uni- 
formly) averages all of the hypotheses and with that of the algorithm that averages 
only the last 100 hypotheses generated by the perceptron algorithm. Figure 2 shows 
the results of our experiments in the first setting. Observe that the error increases with 
the sample size. While the analysis of Section 3 could provide an explanation of this 
phenomenon in the case of the uniform averaging algorithm, in principle, it does not 
explain why the error also increases in the case of our algorithm. The answer to this 
can be found in the setting of the experiment. Notice that the Gaussians considered are 
moving their center and that the squared loss grows proportional to the radius of the 
smallest sphere containing the sample. Thus, as the number of points increases, so does 
the maximum value of the loss function in the test set. Nevertheless, our algorithm still 
outperforms the other two algorithms. It is worth noting that the accuracy of our algo- 
rithm can drastically change of course depending on the choice of the online algorithm 
used. 

7 Conclusion 

We presented a theoretical analysis of the problem of learning with drifting distributions 
in the batch setting. Our learning guarantees improve upon previous ones based on the 



Li distance, in some cases substantially, and our proofs are simpler and concise. These 
bounds benefit from the notion of discrepancy which seems to be the natural measure 
of the divergence between distributions in a drifting scenario. This work motivates a 
number of related studies, in particular a discrepancy-based analysis of the scenario in- 
troduced by [13] and further improvements of the algorithm we presented, in particular 
by exploiting the specific on-line learning algorithm used. 
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